Data Type Detection for Choosing an Appropriate Correlation Coefficient in the Bivariate Case
نویسنده
چکیده
The data scientists usually define a data type based on a nature of variables and select an appropriate correlation measure. However, this is not convenient and very time-consuming in data intensive domains. I propose to detect the types of variables and choose the appropriate correlation coefficient in order to automate the statistical procedure of correlation estimating from mixed data. This should lead to a reduction of time spent on correlation analysis and to increase the accuracy of estimation of correlation coefficients. The continuity index is used to detect whether a variable is continuous or ordered categorical. Based on simulation study I have estimated the cutoff level for the continuity index to choose the Pearson correlation, the polychoric, or the polyserial correlation coefficient.
منابع مشابه
Test of the Correlation Coefficient in Bivariate Normal Populations Using Ranked Set Sampling
Ranked Set Sampling (RSS) is a statistical method for data collection that leads to more efficient estimators than competitors based on Simple Random Sampling (SRS). We consider testing the correlation coefficient of bivariate normal distribution based on Bivariate RSS (BVRSS). Under one-sided and two-sided alternatives, we show that the new tests based on BVRSS are more powerful than the usua...
متن کاملCauchy Regression and Confidence Intervals for the Slope
This paper uses computer simulations to verify several features of the Greatest Deviation (GD) nonparametric correlation coefficient. First, its asymptotic distribution is used in a simple linear regression setting where both variables are bivariate. Second, the distribution free property of GD is demonstrated using both the bivariate normal and bivariate Cauchy distributions. Third, the robust...
متن کاملEstimation of Count Data using Bivariate Negative Binomial Regression Models
Abstract Negative binomial regression model (NBR) is a popular approach for modeling overdispersed count data with covariates. Several parameterizations have been performed for NBR, and the two well-known models, negative binomial-1 regression model (NBR-1) and negative binomial-2 regression model (NBR-2), have been applied. Another parameterization of NBR is negative binomial-P regression mode...
متن کاملA blended model for estimating of missing precipitation data (Case study of Tehran - Mehrabad station)
Meteorological stations usually contain some missing data for different reasons.There are several traditional methods for completing data, among them bivariate and multivariate linear and non-linear correlation analysis, double mass curve, ratio and difference methods, moving average and probability density functions are commonly used. In this paper a blended model comprising the bivariate expo...
متن کاملStochastic simulation of bivariate gamma distribution: a frequency-factor based approach
A frequency-factor based approach for stochastic simulation of bivariate gamma distribution is proposed. The approach involves generation of bivariate normal samples with a correlation coefficient consistent with the correlation coefficient of the corresponding bivariate gamma samples. Then the bivariate normal samples are transformed to bivariate gamma samples using the well-known general equa...
متن کامل